Name: Kaung Si Phyo¶

Class : EL/EP0302/FT/04¶

Title : An In-depth Exploration of Music Trends, Genre Preferences, and Cultural Influences in the Evolving Music Landscape¶

Questions ¶

  • How has the popularity of music tracks changed over the years, comparing the period from 1950 to 1980 with the more recent period from 1980 to 2020? By analyzing the line plots of total play counts for each year in both time periods, what trends or shifts in music preferences can be observed? Additionally, do these trends align with historical events or musical movements that may have influenced listeners' preferences
  • How has the popularity of music genres evolved across two distinct eras, the post-war period (1950 to 1980) and the digital age (1981 to 2020), based on the datasets.Are there any significant shifts in genre popularity, and do these trends reflect broader cultural shifts or technological advancements that have influenced listeners' music choices?
  • Among the unique towns, which specific flat types command higher average prices? Are there variations in pricing based on the flat type within each town?
  • How have listeners' preferences in music genres evolved from the Post-War Era to the Digital Age, and what might be some key events or technological advancements during these periods that could explain the observed shifts?
  • Which music genres, excluding rock, experienced the highest growth in total playcount over the years 2000 to 2015 based on the provided dataset, and how does their growth compare to the total playcount of the 'rock' genre during the same period?
  • Are there any significant relationships between the audio features of tracks (e.g., danceability, energy) and their chart performance or listener counts?
  • How does the distribution of normalized log scrobblescores vary among different music tags in the Last.fm dataset, and which tags exhibit the highest variance in scrobblescore popularity among artists
  • How does the relationship between the normalized listeners and scrobbles on Last.fm influence the popularity and longevity of an artist within specific music genres?
  • To what extent do various factors, such as the total released amount of music, the count of releases by year, and the attributes of different music genres, contribute to the popularity of artists as measured by normalized log scrobblescores in the Last.fm dataset? Which factors seem to have the most significant impact on an artist's popularity, and are there any notable genre-specific trends that influence an artist's fame?
  • How does the relationship between the normalized listeners and scrobbles on Last.fm influence the popularity and longevity of an artist within specific music genres?
  • Is there a significant correlation between music attributes, such as danceability, energy, and tempo, and the play count or popularity of music tracks? Do tracks with certain attributes tend to attract more listeners and achieve higher play counts compared to others?
  • How does the cultural context of different countries influence the music attributes present in popular tracks? Are there genre preferences, thematic elements, or stylistic choices that align with specific cultural backgrounds, and do these attributes contribute to the popularity of music within those regions?

Dataset Links

musicInfo Dataset and user listening count dataset

artist

Dataset: Music Info¶

Nature of the dataset:


This dataset contains information related to music tracks from a rebuilt version and subset of The Million Song Dataset. It was built up with lastfm-spotify-tags-sim-userdata,The Echo Nest Taste Profile Subset & lastfm-dataset-2020, tagtraum genre annotations, and Spotify API. It consists of the following columns:

  • track_id: A unique identifier for each music track.
  • name: The name of the track.
  • artist: The name of the artist who performed the track.
  • spotify_preview_url: The preview URL of the track from Spotify.
  • spotify_id: The unique identifier of the track on Spotify.
  • tags: Tags associated with the track.
  • genre: The genre of the track.
  • year: The year of release of the track.
  • duration_ms: The duration of the track in milliseconds.
  • danceability: A numerical value indicating the danceability of the track.
  • energy: A numerical value indicating the energy level of the track.
  • key: The key in which the track is written.
  • loudness: A numerical value indicating the loudness of the track.
  • mode: A binary value (0 or 1) indicating the modality of the track (major or minor).
  • speechiness: A numerical value indicating the presence of spoken words in the track.
  • acousticness: A numerical value indicating the acousticness of the track.
  • instrumentalness: A numerical value indicating the instrumentalness of the track.
  • liveness: A numerical value indicating the presence of a live audience in the track.
  • valence: A numerical value indicating the musical positiveness of the track.
  • tempo: A numerical value indicating the tempo of the track.
  • time_signature: The time signature of the track.

Peculiarities:

This dataset contains missing values in the 'tags' and 'genre' columns. Additionally, the 'genre' column includes inconsistent labels for music genres (e.g., 'Rock' and 'RnB'). The dataset also includes various audio features (e.g., danceability, energy, loudness) represented by numerical values.

Process of Analysis:

1. Data Loading: The first step in the analysis process involved loading the 'trackInfo.csv' file into a DataFrame using the pandas library.

2. Data Cleaning: The dataset was inspected for missing values and inconsistencies in the 'genre' column. Missing values in the 'tags' column may be handled based on the analysis requirements.

3. Data Exploration: Basic exploratory data analysis (EDA) techniques were applied to understand the structure and data types of the dataset. Summary statistics and visualizations might be used to gain insights into the distribution of track features.

4. Data Analysis: The dataset may be analyzed to identify trends in track features, such as danceability, energy, or acousticness, across different genres or years.

5. Data Visualization: Visualizations like histograms, box plots, or scatter plots could be utilized to represent the distribution and relationships between various audio features and other attributes of the tracks.

6. Insights and Conclusions:

Analysis of the audio features and genres can provide insights into music preferences, genre popularity, and how specific musical characteristics impact the tracks' reception. Understanding the relationships between these features can help music platforms and artists tailor their offerings to cater to user demands and improve user engagement.

Dataset: user Listening counts¶

Nature of the dataset:


This dataset contains information related to music tracks and user interactions from a rebuilt version and subset of The Million Song Dataset. It was built up with lastfm-spotify-tags-sim-userdata,The Echo Nest Taste Profile Subset & lastfm-dataset-2020, tagtraum genre annotations, and Spotify API. It consists of the following columns:

  • track_id: A unique identifier for each music track.
  • user_id: A unique identifier for each user who interacted with the music tracks.
  • playcount: An integer column indicating the number of times each track has been played by users.

Peculiarities:

One peculiarity of this dataset is that it contains missing values (NaN) in the 'genre' column. Additionally, the dataset may contain duplicate records for some music tracks or users, which could impact the analysis.

Process of Analysis:

1. Data Loading: The first step in the analysis process involved loading the 'musicInfo.csv' file into a DataFrame using the pandas library.

2. Data Cleaning: The dataset was inspected for missing values, duplicate records, and unnecessary columns. The 'genre' column with missing values may be dropped or filled based on the analysis requirements.

3. Data Exploration: Basic exploratory data analysis (EDA) techniques were applied to gain insights into the dataset's structure, data types, and memory usage. Summary statistics and visualizations were used to understand the distribution of playcounts and identify potential patterns.

4. Handling Missing Values: If the 'genre' column was required for the analysis, the missing values could be filled using appropriate methods like forward fill, backward fill, or mode imputation.

5. Handling Duplicates: If duplicate records were identified, appropriate actions could be taken, such as dropping duplicates or aggregating the data based on specific criteria.

6. Data Visualization: Further data visualization techniques like histograms, box plots, or scatter plots might be used to visualize the distribution and relationships of the data.

7. Analysis Insights: The analysis aimed to uncover insights into music preferences, track popularity, and genre trends. Understanding the impact of music attributes on playcounts and popularity could help optimize music content for a larger audience.

Conclusion:

This dataset provides valuable information about music track playcounts and user interactions. Analyzing this dataset can reveal patterns and trends in music preferences, helping music platforms tailor their offerings to meet user demands and improve user engagement. It is crucial to handle missing values and duplicates appropriately to ensure accurate and meaningful analysis results.

Dataset: musis¶

Nature of the dataset:


This dataset contains information related to music tracks and user interactions from MusicBrainz and Last.fm . It consists of the following columns:

  • mbid: A unique identifier for each music track.
  • artist_mb: The artist's name from MusicBrainz, a music database.
  • artist_lastfm: The artist's name from Last.fm, a music platform.
  • country_mb: The country associated with the artist in MusicBrainz.
  • country_lastfm: The country associated with the artist in Last.fm.
  • tags_mb: Tags associated with the track from MusicBrainz.
  • tags_lastfm: Tags associated with the track from Last.fm.
  • listeners_lastfm: The number of Last.fm users who have listened to the track.
  • scrobbles_lastfm: The number of times the track has been played (scrobbled) by Last.fm users.
  • ambiguous_artist: A boolean indicating if the artist's name is ambiguous.

Peculiarities:

One peculiarity of this dataset is that it contains missing values (NaN) in the 'artist_lastfm,' 'country_lastfm,' 'tags_mb,' and 'tags_lastfm' columns. Additionally, it includes a boolean column 'ambiguous_artist' that flags whether an artist's name is ambiguous.

Process of Analysis:

1. Data Loading: The first step in the analysis process involved loading the 'musicInfo.csv' file into a DataFrame using the pandas library.

2. Data Cleaning: The dataset was inspected for missing values and unnecessary columns. Depending on the analysis requirements, missing values in relevant columns could be handled by either dropping the rows or filling the missing values with appropriate methods.

3. Data Exploration: Basic exploratory data analysis (EDA) techniques were applied to understand the structure and data types of the dataset. Summary statistics and visualizations might be used to gain insights into the distribution of playcounts and listener counts.

4. Data Visualization: Further data visualization techniques like histograms, box plots, or scatter plots might be used to visualize the distribution and relationships of the data, such as comparing playcounts with listeners, or analyzing the popularity of different tags.

5. Analysis Insights: The analysis aims to uncover insights into music preferences, track popularity, and the impact of various factors on playcounts and scrobbles. Understanding the relationship between artists, countries, and music tags can help optimize music content and improve user engagement on the platform.

Conclusion:

This dataset provides valuable information about music tracks, artists, and user interactions on a music platform. Analyzing this dataset can reveal patterns and trends in music preferences, helping to optimize music content and improve the overall user experience. Proper handling of missing values and consideration of the 'ambiguous_artist' column will ensure accurate and meaningful analysis results.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [39]:
# Read musicInfo.csv
musicInfo = pd.read_csv("musicInfo.csv")
print(musicInfo.info())
print(musicInfo.head(10))

# Drop rows with missing values in musicInfo DataFrame
musicInfo.dropna(inplace=True)

# Read userListeningHistory.csv
listeningHistory = pd.read_csv("userListeningHistory.csv")
print(listeningHistory.info())

# Drop rows with missing values in listeningHistory DataFrame
listeningHistory.dropna(inplace=True)

# Read artists.csv
artists = pd.read_csv("artists.csv", low_memory=False)
print(artists.info())
print(artists.head(10))

# Drop rows with missing values in artists DataFrame
artists.dropna(inplace=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50683 entries, 0 to 50682
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   track_id             50683 non-null  object 
 1   name                 50683 non-null  object 
 2   artist               50683 non-null  object 
 3   spotify_preview_url  50683 non-null  object 
 4   spotify_id           50683 non-null  object 
 5   tags                 49556 non-null  object 
 6   genre                22348 non-null  object 
 7   year                 50683 non-null  int64  
 8   duration_ms          50683 non-null  int64  
 9   danceability         50683 non-null  float64
 10  energy               50683 non-null  float64
 11  key                  50683 non-null  int64  
 12  loudness             50683 non-null  float64
 13  mode                 50683 non-null  int64  
 14  speechiness          50683 non-null  float64
 15  acousticness         50683 non-null  float64
 16  instrumentalness     50683 non-null  float64
 17  liveness             50683 non-null  float64
 18  valence              50683 non-null  float64
 19  tempo                50683 non-null  float64
 20  time_signature       50683 non-null  int64  
dtypes: float64(9), int64(5), object(7)
memory usage: 8.1+ MB
None
             track_id              name           artist  \
0  TRIOREW128F424EAF0    Mr. Brightside      The Killers   
1  TRRIVDJ128F429B0E8        Wonderwall            Oasis   
2  TROUVHL128F426C441   Come as You Are          Nirvana   
3  TRUEIND128F93038C4       Take Me Out  Franz Ferdinand   
4  TRLNZBD128F935E4D8             Creep        Radiohead   
5  TRUMISQ128F9340BEE  Somebody Told Me      The Killers   
6  TRVCCWR128F9304A30      Viva la Vida         Coldplay   
7  TRXOGZT128F424AD74      Karma Police        Radiohead   
8  TRMZXEW128F9341FD5     The Scientist         Coldplay   
9  TRUJIIV12903CA8848            Clocks         Coldplay   

                                 spotify_preview_url              spotify_id  \
0  https://p.scdn.co/mp3-preview/4d26180e6961fd46...  09ZQ5TmUG8TSL56n0knqrj   
1  https://p.scdn.co/mp3-preview/d012e536916c927b...  06UfBBDISthj1ZJAtX4xjj   
2  https://p.scdn.co/mp3-preview/a1c11bb1cb231031...  0keNu0t0tqsWtExGM3nT1D   
3  https://p.scdn.co/mp3-preview/399c401370438be4...  0ancVQ9wEcHVd0RrGICTE4   
4  https://p.scdn.co/mp3-preview/e7eb60e9466bc3a2...  01QoK9DA7VTeTSE3MNzp4I   
5  https://p.scdn.co/mp3-preview/0d07673cfb46218a...  0FNmIQ7u45Lhdn6RHhSLix   
6  https://p.scdn.co/mp3-preview/ab747fed1bfab2ac...  08A1lZeyLMWH58DT6aYjnC   
7  https://p.scdn.co/mp3-preview/5a09f5390e2862af...  01puceOqImrzSfKDAcd1Ia   
8  https://p.scdn.co/mp3-preview/95cb9df1b056d759...  0GSSsT9szp0rJkBrYkzy6s   
9  https://p.scdn.co/mp3-preview/24c7fe858b234e3c...  0BCPKOYdS2jbQ8iyB56Zns   

                                                tags genre  year  duration_ms  \
0  rock, alternative, indie, alternative_rock, in...   NaN  2004       222200   
1  rock, alternative, indie, pop, alternative_roc...   NaN  2006       258613   
2   rock, alternative, alternative_rock, 90s, grunge   RnB  1991       218920   
3  rock, alternative, indie, alternative_rock, in...   NaN  2004       237026   
4  rock, alternative, indie, alternative_rock, in...   RnB  2008       238640   
5  rock, alternative, indie, pop, alternative_roc...   NaN  2005       198480   
6  rock, alternative, indie, pop, alternative_roc...   NaN  2013       235384   
7  rock, alternative, indie, alternative_rock, in...   NaN  1996       264066   
8  rock, alternative, indie, pop, alternative_roc...  Rock  2007       311014   
9  rock, alternative, indie, pop, alternative_roc...   NaN  2002       307879   

   danceability  ...  key  loudness  mode  speechiness  acousticness  \
0         0.355  ...    1    -4.360     1       0.0746      0.001190   
1         0.409  ...    2    -4.373     1       0.0336      0.000807   
2         0.508  ...    4    -5.783     0       0.0400      0.000175   
3         0.279  ...    9    -8.851     1       0.0371      0.000389   
4         0.515  ...    7    -9.935     1       0.0369      0.010200   
5         0.508  ...   10    -4.289     0       0.0847      0.000087   
6         0.588  ...    8    -7.903     1       0.1050      0.153000   
7         0.360  ...    7    -9.129     1       0.0260      0.062600   
8         0.566  ...    5    -7.826     1       0.0242      0.715000   
9         0.577  ...    5    -7.215     0       0.0279      0.599000   

   instrumentalness  liveness  valence    tempo  time_signature  
0          0.000000    0.0971    0.240  148.114               4  
1          0.000000    0.2070    0.651  174.426               4  
2          0.000459    0.0878    0.543  120.012               4  
3          0.000655    0.1330    0.490  104.560               4  
4          0.000141    0.1290    0.104   91.841               4  
5          0.000643    0.0641    0.704  138.030               4  
6          0.000000    0.0634    0.520  137.973               4  
7          0.000092    0.1720    0.317   74.807               4  
8          0.000014    0.1200    0.173  146.365               4  
9          0.011500    0.1830    0.255  130.970               4  

[10 rows x 21 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9711301 entries, 0 to 9711300
Data columns (total 3 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   track_id   object
 1   user_id    object
 2   playcount  int64 
dtypes: int64(1), object(2)
memory usage: 222.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1466083 entries, 0 to 1466082
Data columns (total 10 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   mbid              1466083 non-null  object 
 1   artist_mb         1466075 non-null  object 
 2   artist_lastfm     986756 non-null   object 
 3   country_mb        662368 non-null   object 
 4   country_lastfm    211498 non-null   object 
 5   tags_mb           119946 non-null   object 
 6   tags_lastfm       381075 non-null   object 
 7   listeners_lastfm  986760 non-null   float64
 8   scrobbles_lastfm  986760 non-null   float64
 9   ambiguous_artist  1466083 non-null  bool   
dtypes: bool(1), float64(2), object(7)
memory usage: 102.1+ MB
None
                                   mbid              artist_mb  \
0  cc197bad-dc9c-440d-a5b5-d52ba2e14234               Coldplay   
1  a74b1b7f-71a5-4011-9441-d0b5e4122711              Radiohead   
2  8bfac288-ccc5-448d-9573-c33ea2aa5c30  Red Hot Chili Peppers   
3  73e5e69d-3554-40d8-8516-00cb38737a1c                Rihanna   
4  b95ce3ff-3d05-4e87-9e01-c97b66af13d4                 Eminem   
5  95e1ead9-4d31-4808-a7ac-32c3614c116b            The Killers   
6  164f0d73-1234-4e2c-8743-d77bf2191051             Kanye West   
7  5b11f4ce-a62d-471e-81fc-a69a8278c7da                Nirvana   
8  9c9f1380-2516-4fc9-a3e6-f9f61941d090                   Muse   
9  0383dadf-2a4e-4d10-a46a-e9e041da8eb3                  Queen   

           artist_lastfm      country_mb           country_lastfm  \
0               Coldplay  United Kingdom           United Kingdom   
1              Radiohead  United Kingdom           United Kingdom   
2  Red Hot Chili Peppers   United States            United States   
3                Rihanna   United States  Barbados; United States   
4                 Eminem   United States            United States   
5            The Killers   United States                      NaN   
6             Kanye West   United States            United States   
7                Nirvana   United States            United States   
8                   Muse  United Kingdom           United Kingdom   
9                  Queen  United Kingdom           United Kingdom   

                                             tags_mb  \
0  rock; pop; alternative rock; british; uk; brit...   
1  rock; electronic; alternative rock; british; g...   
2  rock; alternative rock; 80s; 90s; rap; metal; ...   
3  pop; dance; hip hop; reggae; contemporary r b;...   
4  turkish; rap; american; hip-hop; hip hop; hiph...   
5  synthpop; alternative rock; american; new wave...   
6  synthpop; pop; american; hip-hop; hip hop; ele...   
7  rock; alternative rock; 90s; punk; american; e...   
8  rock; electronic; synthpop; alternative rock; ...   
9  rock; progressive rock; 70s; 80s; 90s; pop-roc...   

                                         tags_lastfm  listeners_lastfm  \
0  rock; alternative; britpop; alternative rock; ...         5381567.0   
1  alternative; alternative rock; rock; indie; el...         4732528.0   
2  rock; alternative rock; alternative; Funk Rock...         4620835.0   
3  pop; rnb; female vocalists; dance; Hip-Hop; Ri...         4558193.0   
4  rap; Hip-Hop; Eminem; hip hop; pop; american; ...         4517997.0   
5  indie; rock; indie rock; alternative; alternat...         4428868.0   
6  Hip-Hop; rap; hip hop; rnb; Kanye West; seen l...         4390502.0   
7  Grunge; rock; alternative; alternative rock; 9...         4272894.0   
8  alternative rock; rock; alternative; Progressi...         4089612.0   
9  classic rock; rock; 80s; hard rock; glam rock;...         4023379.0   

   scrobbles_lastfm  ambiguous_artist  
0       360111850.0             False  
1       499548797.0             False  
2       293784041.0             False  
3       199248986.0             False  
4       199507511.0             False  
5       208722092.0             False  
6       238603850.0             False  
7       222303859.0             False  
8       344838631.0             False  
9       191711573.0             False  
In [40]:
track_id_playcount = listeningHistory.groupby('track_id').playcount.agg(['count', 'sum'])
complete_info = track_id_playcount.merge(musicInfo, on='track_id')

# Group by 'year' and sum 'playcount'
yearly_playcount = complete_info.groupby('year').sum(numeric_only=True)['sum']
# Creating a DataFrame with all years and initializing with 0
all_years = pd.DataFrame({'year': np.arange(1900, 2021), 'playcount': 0})

# Set 'year' as index in both DataFrames for the merge operation
all_years.set_index('year', inplace=True)
yearly_playcount = yearly_playcount.to_frame().rename(columns={'sum': 'playcount'})

# Merge dataframes
final_counts = all_years.merge(yearly_playcount, left_index=True, right_index=True, how='left')

# Fill NaN values with 0
final_counts.fillna(0, inplace=True)

# Sum the playcounts if there are multiple columns
final_counts['playcount'] = final_counts.sum(axis=1)
final_counts = final_counts.reset_index()
# Filter the data from 1950 to 1980
filtered_data_1 = final_counts[(final_counts['year'] >= 1950) & (final_counts['year'] <= 1980)]

# Create a subplot for the first time period (1950 to 1980)
plt.figure(figsize=(18, 12))
plt.subplot(2, 1, 1)
plt.plot(filtered_data_1['year'], filtered_data_1['playcount'], marker="o", linestyle="-")
plt.xlabel('Year')
plt.ylabel('Playcount')
plt.title('Year vs Playcount (1950 to 1980)')
plt.xticks(range(1950, 1981), rotation=90)
plt.grid(True)

# Filter the data from 1980 to 2020
filtered_data_2 = final_counts[(final_counts['year'] >= 1980) & (final_counts['year'] <= 2020)]

# Create a subplot for the second time period (1980 to 2020)
plt.subplot(2, 1, 2)
plt.plot(filtered_data_2['year'], filtered_data_2['playcount'], marker="o", linestyle="-")
plt.xlabel('Year')
plt.ylabel('Playcount')
plt.title('Year vs Playcount (1980 to 2020)')
plt.xticks(range(1980, 2021), rotation=90)
plt.grid(True)

plt.tight_layout()
plt.show()
In [41]:
import matplotlib.pyplot as plt
merged_df = listeningHistory.merge(musicInfo, on='track_id')
filtered_post_war = merged_df[(merged_df['year'] >= 1950) & (merged_df['year'] <= 1980)]
filtered_digital = merged_df[(merged_df['year'] >= 1981) & (merged_df['year'] <= 2020)]
grouped_post_war = filtered_post_war.groupby('genre')['playcount'].agg('sum')
grouped_digital = filtered_digital.groupby('genre')['playcount'].agg('sum')
fig1, ax1 = plt.subplots()
ax1.bar(grouped_post_war.index, grouped_post_war.values)
ax1.set_xlabel('Genre')
ax1.set_ylabel('Total Playcount')
ax1.set_title('Total Playcount per Genre from 1945 to 1980 (Post-War Era)')
plt.xticks(rotation=90)
plt.show()
fig2, ax2 = plt.subplots()
ax2.bar(grouped_digital.index, grouped_digital.values)
ax2.set_xlabel('Genre')
ax2.set_ylabel('Total Playcount')
ax2.set_title('Total Playcount per Genre from 1981 to 2020 (Digital Age)')
plt.xticks(rotation=90)
plt.show()
In [42]:
import matplotlib.pyplot as plt

merged_df = listeningHistory.merge(musicInfo, on='track_id')
merged_df['genre'] = merged_df['genre'].str.lower()
filtered_df = merged_df[(merged_df['year'] >= 2000) & (merged_df['year'] <= 2015)]
genres = ['country', 'electronic', 'pop', 'rap', 'metal', 'rock']
filtered_df = filtered_df[filtered_df['genre'].isin(genres)]
grouped = filtered_df.groupby(['year', 'genre'])['playcount'].agg('sum').reset_index()
fig1, ax1 = plt.subplots()

for genre in genres:
    if genre == 'rock':
        continue
    data = grouped[grouped['genre'] == genre]
    ax1.plot(data['year'], data['playcount'], label=genre, marker="o", linestyle="-")

ax1.set_xlabel('Year')
ax1.set_ylabel('Total Playcount')
ax1.set_title('Total Playcount per Genre from 2000 to 2015 (Excluding Rock)')
ax1.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
ax1.grid(True)
plt.show()
fig2, ax2 = plt.subplots()
data = grouped[grouped['genre'] == 'rock']
ax2.plot(data['year'], data['playcount'], label='rock', marker="o", linestyle="-")

ax2.set_xlabel('Year')
ax2.set_ylabel('Total Playcount')
ax2.set_title('Total Playcount for Rock from 2000 to 2015')
ax2.legend(loc='upper left', bbox_to_anchor=(1.05, 1))
ax2.grid(True)
In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample 10% of your data
sample_artists = artists.sample(frac=1)

# Convert the genres to a set for fast lookup
genre_set = set(musicInfo['genre'].str.lower())

tags_series = sample_artists['tags_lastfm'].str.split(';')
flattened_tags = tags_series.explode().str.strip()

# Get the unique tags that are also in genre_set
matching_tags = flattened_tags[flattened_tags.isin(genre_set)].unique()
selected_artists = sample_artists[sample_artists['tags_lastfm'].apply(lambda tags: any(tag in matching_tags for tag in str(tags).split(';')))].copy()


# Create a new column 'matching_tags' in selected_artists DataFrame that contains the list of matching tags for each row
selected_artists.loc[:, 'matching_tags'] = selected_artists['tags_lastfm'].apply(lambda tags: [tag for tag in str(tags).split(';') if tag in matching_tags])

# Explode 'matching_tags' so that each row represents one tag and its corresponding scrobblescore
exploded_artists = selected_artists.explode('matching_tags')

def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

# Apply logarithmic transformation
exploded_artists['log_scrobbles'] = np.log(exploded_artists['scrobbles_lastfm'] + 1)  # Adding 1 to handle 0 values

# Apply normalization to log-transformed scrobbles
exploded_artists['normalized_log_scrobbles'] = normalize(exploded_artists['log_scrobbles'])

plt.figure(figsize=(12, 8))
tags = exploded_artists['matching_tags'].unique()

# Create a dictionary to hold normalized scrobbles for each tag
normalized_scrobbles_dict = {tag: exploded_artists.loc[exploded_artists['matching_tags']==tag, 'normalized_log_scrobbles'].values for tag in tags}

# Create a list of lists for boxplot data
boxplot_data = [normalized_scrobbles_dict[tag] for tag in tags]

plt.boxplot(boxplot_data, vert=True)
plt.xticks(range(1, len(tags)+1), tags, rotation=90)
plt.xlabel('Matching Tags ')
plt.ylabel('Normalized Log Scrobblescore')
plt.title('Normalized Log Scrobblescore Distribution for Each Tag')
plt.tight_layout()
plt.show()
In [44]:
complete_info = track_id_playcount.merge(musicInfo, on='track_id')
complete_info = complete_info[(complete_info['year'] >= 2000) & (complete_info['year'] <= 2015)]
# Group by 'genre' and count unique 'track_id'
songs_per_genre = complete_info.groupby('genre')['track_id'].nunique()

# Convert to DataFrame and add a 'percentage' column
songs_per_genre = songs_per_genre.to_frame()
songs_per_genre.columns = ['count']
songs_per_genre['percentage'] = (songs_per_genre['count'] / songs_per_genre['count'].sum()) * 100
neglected_text = "Neglected Portions:\n" + "\n"

# Create labels for the pie chart
def create_label(row):
    global neglected_text
    if row['percentage'] < 2:
        neglected_text += f"{row.name} - {row['count']} ({row['percentage']:.1f})%\n"
        return ''
    else:
        return f"{row.name} - {row['count']} ({row['percentage']:.1f}%)"

songs_per_genre['label'] = songs_per_genre.apply(create_label, axis=1)

# Plot
fig, ax = plt.subplots(figsize=(10, 8))
wedges, texts, autotexts = ax.pie(songs_per_genre['count'], labels=songs_per_genre['label'], autopct='%1.1f%%')
ax.set_title("Number of songs per genre between 2000 and 2015")
plt.text(1.2, 0.5, neglected_text, transform=ax.transAxes, fontsize=10)

# Hide labels and autotexts for neglected portions
for text, autotext in zip(texts, autotexts):
    if text.get_text() == "":
        text.set_visible(False)
        autotext.set_visible(False)

plt.show()
In [45]:
# Merge the datasets
complete_info = musicInfo.merge(listeningHistory, on='track_id')

# Filter data for the specified genres
filtered_data = complete_info[(complete_info['genre'].isin(['Metal', 'Pop', 'Electronic', 'Rock', 'RnB', 'Rap'])) &
                              (complete_info['year'] >= 1980) & 
                              (complete_info['year'] <= 2015)]

# Create subplots
fig, axs = plt.subplots(3, 2, figsize=(14, 15))  # Adjusted size and layout for 6 plots

# Create a list of genres and axs indices
genres = ['Metal', 'Pop', 'Electronic', 'Rock', 'RnB', 'Rap']  # Added 'Rap'
axs_indices = [(0,0), (0,1), (1,0), (1,1), (2,0), (2,1)]  # Added indices for the additional plots

# For each genre, filter data and plot histogram
for genre, ax_index in zip(genres, axs_indices):
    genre_data = filtered_data[filtered_data['genre'] == genre]
    axs[ax_index].hist(genre_data['year'], bins=16, color='blue', edgecolor='black', alpha=0.7)
    
    axs[ax_index].set_title(f'Yearly Distribution for {genre} (2000-2015)')
    axs[ax_index].set_xlabel('Year')
    axs[ax_index].set_ylabel('Number of Songs')

# Adjust the spacing between subplots
plt.tight_layout()
plt.show()
In [46]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Define a function to normalize a pandas Series
def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

# Convert the genres to a set for fast lookup
genre_set = set(musicInfo['genre'].str.lower())

# Create a DataFrame that maps each artist to a genre
artists_genres = pd.DataFrame([(row['artist_lastfm'], tag.strip()) for _, row in artists.iterrows() for tag in str(row['tags_lastfm']).split(';') if tag.strip() in genre_set], columns=['artist', 'genre'])

# Merge this DataFrame with artists on 'artist'
merged_artists = pd.merge(artists, artists_genres, left_on='artist_lastfm', right_on='artist')

# Apply logarithmic transformation and then normalize 'listeners_lastfm' and 'scrobbles_lastfm' columns
merged_artists['listeners_lastfm'] = normalize(np.log1p(merged_artists['listeners_lastfm']))
merged_artists['scrobbles_lastfm'] = normalize(np.log1p(merged_artists['scrobbles_lastfm']))
merged_artists['listeners_lastfm'] = merged_artists['listeners_lastfm'].fillna(merged_artists['listeners_lastfm'].mean())
merged_artists['scrobbles_lastfm'] = merged_artists['scrobbles_lastfm'].fillna(merged_artists['scrobbles_lastfm'].mean())

# Scatter Plot
plt.figure(figsize=(12, 6))
plt.scatter(merged_artists['listeners_lastfm'], merged_artists['scrobbles_lastfm'], alpha=0.5, s=1)
plt.title('Scatter Plot')
plt.xlabel('Normalized Listeners_lastfm')
plt.ylabel('Normalized Scrobbles_lastfm')
plt.show()
In [47]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
properties = ['danceability', 'energy', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo']

# Group by genre and calculate mean for each property
grouped = musicInfo.groupby('genre')[properties].mean()
min_counts = 50
genres_to_include = musicInfo['genre'].value_counts()
genres_to_include = genres_to_include[genres_to_include > min_counts].index
grouped = grouped.loc[grouped.index.isin(genres_to_include)]
fixed_order = sorted(grouped.index)
colors = sns.color_palette("tab10", len(properties))

for idx, prop in enumerate(properties):
    plt.figure(figsize=(15,7))
    grouped.loc[fixed_order][prop].plot(kind='bar', color=colors[idx])
    plt.title(f'Average {prop.capitalize()} by Genre')
    plt.ylabel(prop.capitalize())
    plt.xlabel('Genre')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
In [48]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sum up playcount for each track
playcounts_summed = listeningHistory.groupby('track_id')['playcount'].sum().reset_index()

def normalize(series):
    return (series - series.min()) / (series.max() - series.min())

# Merge the two DataFrames
merged = pd.merge(musicInfo, playcounts_summed, on='track_id')
merged['log_playcount'] = np.log(merged['playcount'] + 1)
merged['log_playcount'] = normalize(merged['log_playcount'])

properties = ['danceability', 'energy', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo']

# Normalize the properties
for prop in properties:
    merged[prop] = normalize(merged[prop])

sns.set_style("whitegrid")

# Loop through the properties and create the plots
for prop in properties:
    # Hexbin plot
    plt.figure(figsize=(10, 8))
    sns.jointplot(data=merged, x=prop, y='log_playcount', kind='hex', gridsize=50, cmap='viridis', marginal_kws=dict(kde=True))
    plt.xlabel(prop.capitalize())
    plt.ylabel('Normalized Log Total Playcount')
    plt.suptitle(f'{prop.capitalize()} vs Normalized Log Total Playcount', y=1.02)
    
    # Histogram
    plt.figure(figsize=(10, 8))
    sns.histplot(data=merged, x='log_playcount', bins=30, kde=True, color='blue', label='Log Playcount')
    sns.histplot(data=merged, x=prop, bins=30, kde=True, color='red', label=prop.capitalize(), ax=plt.gca())
    plt.xlabel(f'Normalized Log Total Playcount and {prop.capitalize()}')
    plt.ylabel('Frequency')
    plt.title(f'Histogram of {prop.capitalize()} and Normalized Log Total Playcount')
    plt.legend()

plt.tight_layout()
plt.show()
C:\Users\23575\AppData\Local\Temp\ipykernel_22476\1017712067.py:35: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  plt.figure(figsize=(10, 8))
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
<Figure size 1000x800 with 0 Axes>
In [49]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming your DataFrames are named df_song, df_artist
merged_df = pd.merge(musicInfo, artists[['artist_mb', 'country_mb']], 
                     left_on='artist', right_on='artist_mb', how='inner')

# Drop unnecessary columns
merged_df.drop(columns=['artist_mb'], inplace=True)
top_countries = merged_df['country_mb'].value_counts().head(20).index.tolist()
print(top_countries)

filtered_df = merged_df[merged_df['country_mb'].isin(top_countries)]
properties = ['danceability', 'energy', 'loudness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo']

# Define the figure size and number of plots in the grid
fig, axes = plt.subplots(nrows=len(properties), figsize=(20, 8*len(properties))) 

for i, prop in enumerate(properties):
    sns.violinplot(data=filtered_df, x='country_mb', y=prop, palette='rainbow', ax=axes[i])
    axes[i].set_title(f"Distribution of {prop.capitalize()} by Country")
    axes[i].set_xticklabels(filtered_df['country_mb'].unique(), rotation=45)


plt.tight_layout()
plt.show()
['United States', 'United Kingdom', 'Germany', 'Sweden', 'Canada', 'France', 'Finland', 'Australia', 'Norway', 'Jamaica', 'Ireland', 'Poland', 'Brazil', 'Netherlands', 'Denmark', 'Italy', 'Switzerland', 'Iceland', 'Japan', 'Belgium']

Conclusion

Throughout this extensive analysis of music data from different eras and regions, we have gained valuable insights into various aspects of music trends, preferences, and the factors influencing music popularity. By comparing play counts between the periods 1950-1980 and 1980-2020, we observed significant shifts in music preferences, with the digital age witnessing a surge in engagement with music due to technological advancements.

The evolution of music genres across different eras revealed dynamic changes in genre popularity, reflecting broader cultural shifts and technological influences. Listeners' genre preferences also demonstrated shifts over time, shaped by key events and advancements in the music landscape.

Notably, certain non-rock genres experienced substantial growth in play counts from 2000 to 2015, providing insights into the changing landscape of music popularity. Furthermore, a correlation analysis between audio features and play counts uncovered potential relationships, with attributes like danceability and energy attracting more listeners and contributing to higher play counts.

Examining the relationship between listeners and scrobbles on Last.fm emphasized the importance of audience engagement in influencing an artist's success. Factors such as the total released amount of music, count of releases by year, and music genre attributes all played a role in artist popularity and playcount.

Our study also highlighted the significant influence of cultural context on music attributes found in popular tracks. Genre preferences, thematic elements, and stylistic choices were shown to be influenced by specific cultural backgrounds, contributing to music popularity within distinct regions.

In conclusion, this thorough analysis offers valuable insights for music platforms, artists, and stakeholders to optimize content, engage with audiences effectively, and make informed decisions in the ever-evolving music industry. By understanding these patterns and trends, the music industry can continue to thrive and adapt to the changing preferences of global audiences.